The System and the Study

To evaluate the impact of Omitted Variable Bias (OVB) on different models, consider a system where oceanography drives site temperature and recruitment over time. Temperature also fluctuates over time within each site. Recruitment and temperature influence snail abundance, as do other uncorrelated drivers. You have conducted a study in this system measuring 10 sites, each site sampled once per year over 10 years. In this study, you have recorded snail abundance and temperature. But have no measure of recruitment. Note, the results of models will be the same if you had instead sampled in one year across 10 sites with 10 plots per site if there were plot-level drivers of temperature that behaved the same as below.

To parameterize our simulations, consider the following:

Thus, the system looks like this:

Functions to Create The System

To simulated data, let’s begin by loading some libraries

library(tidyverse)
library(lme4)
library(broom)
library(broom.mixed)
library(DiagrammeR)
library(glue)

Next, we need a function that will create a template of simulated sites based on oceanography and the sampling design described above.

make_environment <- function(n_sites = 10,
                             ocean_temp = 2,
                             temp_sd = 0,
                             ocean_recruitment = -2,
                             recruitment_sd = 0,
                             temp_mean = 15,
                             rec_mean = 100,
                             seed = NULL){
  
  if(!is.null(seed)) set.seed(seed)
  
  tibble(
    site = as.character(1:n_sites)) %>%
    mutate(
      oceanography = rnorm(n_sites),
      site_temp = temp_mean + 
        rnorm(n_sites, ocean_temp * oceanography, temp_sd),
      site_recruitment = rec_mean +
        rnorm(n_sites, ocean_recruitment * oceanography, recruitment_sd)
      
    )
}

Great. Now, we need to add that year-to-year or plot-to-plot variability.

make_plots <- function(sites_df,
                       n_plots_per_site = 10,
                       plot_temp_sd = 1,
                       temp_effect = 1,
                       recruitment_effect = 1,
                       sd_plot = 3,
                       seed = NULL){
  
  if(!is.null(seed)) set.seed(seed)
  
  sites_df %>%
    rowwise() %>%
    mutate(
      plot_temp_dev_actual = list(rnorm(n_plots_per_site,
                                        0, plot_temp_sd))) %>%
    unnest(plot_temp_dev_actual) %>%
    mutate(
      plot_temp = site_temp + plot_temp_dev_actual,
      snails = rnorm(n(),
                     temp_effect*plot_temp +
                       recruitment_effect*site_recruitment,
                     sd_plot)) %>%
    ungroup() %>%
    group_by(site) %>%
    mutate(year = 1:n(),
           site_mean_temp = mean(plot_temp),
           plot_temp_dev = plot_temp - site_mean_temp,
           site_mean_snail = mean(snails),
           site_snail_dev = snails - mean(snails),
           delta_snails = snails - lag(snails),
           delta_temp = plot_temp - lag(plot_temp)) %>%
    ungroup()
  
}

To analyze our data, we will compare several different fit models.

analyze_plots <- function(plot_df){
  
  # plot_df <- plots_df$plots[[3]] 
  
  m <-  tribble(
    ~model_type, ~fit,
    "Naieve", lm(snails ~ plot_temp, data = plot_df),
    "RE", lmer(snails ~ plot_temp + (1|site), data = plot_df),
    "FE", lm(snails ~ plot_temp + site, data = plot_df),
    "Group Mean Covariate", lmer(snails ~ plot_temp + site_mean_temp + (1|site), data = plot_df),
    "Group Mean Centered", lmer(snails ~ plot_temp_dev + site_mean_temp + (1|site), data = plot_df),
    "Panel", lm(delta_snails ~ delta_temp,data = plot_df)
    
  ) %>%
    mutate(coefs = map(fit, tidy),
           temp_effect = map(coefs, get_temp_coef),
           model_type = fct_inorder(model_type))
  
  m
}

get_temp_coef <- function(a_tidy_frame){
  a_tidy_frame %>%
    filter(term %in% c("plot_temp", "plot_temp_dev", "delta_temp")) %>%
    select(estimate, std.error)
}

Simulations and Results

Let’s begin by setting up 100 replicate simulations.

set.seed(31415)
n_sims <- 100

envt <- tibble(
  sims = 1:n_sims
) %>%
  mutate(sites = map(sims, ~make_environment())) 

Just for a sanity check, here’s the relationship between temperature and recruitment at the site level across all simulations.

ggplot(envt %>%
         unnest(sites),
       aes(x = site_temp, y = site_recruitment, group = sims)) +
  geom_point(alpha = 0.4, color = 'grey') +
  stat_smooth(method = "lm", size = 0.5, alpha = 0.5,
              fill = NA, color = "black")
## `geom_smooth()` using formula 'y ~ x'

Great! Now, let’s setup our sampling over time.

plots_df <- envt %>%
  mutate(site_year = map(sites, make_plots))

And, again, a sanity check…

ggplot(plots_df %>%
         unnest(site_year),
       aes(x = plot_temp, y = snails, color = site)) +
  geom_point(alpha = 0.5)

So we can see that in this setup, the snail-temperature relationship is positive. But, how positive is it? What would our coefficients show?

Let’s fit models to each set of data

analysis_df <- plots_df %>%
  mutate(analysis = map(site_year, analyze_plots)) %>%
  unnest(analysis)

And now let’s look at the distribution of coefficients that would describe the relationship between temperature and snails from each mode.

analysis_df %>%
  unnest(temp_effect) %>%
  ggplot(aes(y = fct_rev(model_type), x = estimate)) +
  ggridges::stat_density_ridges() +
  labs(y="")
## Picking joint bandwidth of 0.0998

Eyeballing it, we can see of course the naive model is too low. How bad is the bias for the RE model?

analysis_df %>%
  unnest(temp_effect) %>%
  group_by(model_type) %>%
  summarize(mean_estimate = mean(estimate),
            sd_estimate = sd(estimate))
## # A tibble: 6 × 3
##   model_type           mean_estimate sd_estimate
##   <fct>                        <dbl>       <dbl>
## 1 Naieve                       0.230       0.160
## 2 RE                           0.297       0.201
## 3 FE                           0.989       0.326
## 4 Group Mean Covariate         0.989       0.326
## 5 Group Mean Centered          0.989       0.326
## 6 Panel                        0.966       0.389

The downward bias produced by poor model choice is clear.